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1. INTRODUCTION 

Vehicle make and model recognition systems (abbreviated VMMR) are part of intelligent 
transportation systems (abbreviated ITS), a field that has seen significant advances since the last twenty years. 
An intelligent transportation sytem is the integration of ’smart” communication and information processing 
technologies to the transportation system . Althought it usually encompasses all modes of transports, it is 
used to describe road, vehicle and driver interactions and aims to make them more dynamical, by enabling 
better and faster decision making for both road users and supervisers (if any). The push for better ITS and the 
massive investments made in it by countries like Japan or the USA naturally sowed the fields of innovation in 
all its related domains. Not only that but the gap in computational power, video-sensing and miniaturization 
technologies since the early nineties or even since the last decade continues to widens steadily, leading to 
more dense and powerful devices along with more sophisticated and performant VMMR software. The most 
widespread taxonomy of VMMR methods is split into two broad categories, namely appearance based and 
model based. Appearance based methods rely on photometric information and the features extracted from it 
(edges, corners,gradient...), whereas model based methods seek to obtain a good geometrical 
representation of the objects in the image often relying on stereo vision and two dimensionnal techniques for 
primitive extraction. Here we use the front part of the car which is considered to be the most 
discriminative [1-11], but different parts have been used too like a combination of different parts [12], the 
rear [13, 14], the logo [15] and even 3D modeling [16]. In this paper we build an appearance based VMMR 
system on the most discriminative part of the car located at the front using various feature extraction and 
classification techniques. We consider the VMMR problem in a bag of features framework that combines local 
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low-level features, sparse coding, dictionary learning and support vector machines (SVM). These local low- 
level features are used to form a global descriptor, which can be seen as a mid-level feature as mentioned in [17] 
and are also extensively used in deep neural networks [18, 19]. Sparse coding originaly used fixed dictionaries 
until Olshausen and Field [20, 21] where they used data to generate dictionaries representing an underlying 
hidden structure of said data. Raina et al. [22], compare PCA and sparse coding, building a dictionary using 
unlabeled images to obtain sparse representations of labeled images and feed them to a support vector machine. 
Yang et al. [23], build the dictionary from SIFT descriptor data instead of raw images and use a one-against- 
all multiclass linear SVM. In Zeiler et al. [24], a dictionary is built by stacking deconvolution and max 
pooling layers and uses conjugate gradients to update the filters (K filters for each layer) of the hierarchical 
dictionary, each deconvolution layer seeks to minimize the reconstruction error of an input image under a 
sparsity penalty. In general the aforementioned methods use the alternating or sequential optimization 
approach which updates either the dictionary or the coding matrix while fixing the other. More modern 
methods like Dir of Rakotomamonjy [25] jointly optmize the dictionary D and the coding matrix A which 
improves the algorithm runtime. 


2. THE BAG OF FEATURES METHOD 

The general bag of features framework follows the six steps detailed below and shown in Figure 1. 
The bag of features although supervised by nature (the use of SVMs) relies more or less exclusively on a 
unsupervised learning method to generate said features in its argualbly most important step, i.d. 
the dictionary generation or learning step. Dictionary learning can be seen as matrix factorization 
problem along side other methods such as principal component analysis (PCA), clustering or vector 
quantization [26, 27], non-negative matrix factorization (NMF) [28, 29], archetypal analysis [30], 
or independent component analysis (ICA) [31-34]. 
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Figure 1. Bag of features framework 


2.1. Patch extraction 

To streamline the process of extracting the relevant image patches as seen, we use [35] algorithm as 
illustrated in Figure 2. Simple and straightforward it is based on the use of horizontal gradient, morphological 
operations and connected component analysis to solve the license plate location in the image, which in turn 
enables us to obtain the front part of the vehicle for further processing. Then we sample local areas of the image 
patches acquired previously either densely [36, 37] or sparsely [38-40]. In our implementation we choose to 
use dense grid coverage of 2x2, 4x4, 8x8 or 16x16 patches over the image with no scaling and because of the 
tight segmentation, background clutter is kept to a strict minimum. However there are better sampling methods 
as pointed out by Nowak et al. [41] than dense sampling, many regions of the image have low discriminative 
power (most of them in fact) which hurts the performance. 
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Figure 2. Steps of the license plate extraction algorithm [35] 


2.2. Feature description 

This process maps the input pixels from a sparse or dense sample, see Figure 3, into feature vectors, 
the map may usually be gradient based like Scale Invariant Feature Transform (SIFT) [39], Speeded-Up Robust 
Features (SURF) [40], Histogram of Gradients (HOG) [37] and Laplacian of Gaussian (LOG) [42] descriptors 
or statistically based like Principal Component Analysis (PCA) [43] or Linear Discriminant Analysis (LDA) 
[44, 45]. Here we use the Square Mapped Gradients (SMG) [1], which is a gradient based feature descriptor: 
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(b) 


Figure 3. Sparse and dense sampling methods: (a) Sparse sampling, (b) Dense sampling 


2.3. Dictionary generation 

This step is crucial to the overall performance of the system as it computes codewords from the 
feature descriptors of the second step, these codewords then form the dictionary as shown in Figure 4, the 
standard way of generating codewords (i.e.,dictionary) consists of simply clustering over the feature vectors set 
(using K-Means) [46], however other unsupervised [46, 47, 17] and supervised [48] methods are used as well. 
Dictionary learning in this paper follows the algorithm described in [49]. in which given a finite set of feature 
vectors X = [%1,...,%,] € R'™*” , optimize the following problem also known as the Lasso [50]: 


. i 2 

min, 5 lle — Dall} + Allalls, (2) 

where D € R’** is an overcomplete dictionary (k > m, k being the number of codewords and m the 
dimension of the feature vectors), a = [aj,...,Qn] € R**" the coding matrix, the J, regularization 
parameter and ||q||1 the J; Lasso penalty which induces sparsity [49] in the coding matrix a. 
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Figure 4. Top row from the left to the right 2x2 , 4x4, 8x8, 16x16 codewords corresponding to their respective 
bottom row 121 codewords dictionaries 


2.4. Feature coding 

With the dictionary and feature vectors in hand, we can obtain the coding matrix which generally 
describes the relationship between the feature vectors and the codewords (dictionary), where each feature vector 
activates a number of codewords thus forming a coding vector either with binary (hard vector 
quantization) or continuous elements (soft vector quantization, sparse coding) [51]. In our setup we use [51] call 
reconstruction based coding using a part of the codewords to describe the features via solving a least-square 
optimization problem, here the Lasso (2), in fact both the dictionary and the coding matrix are obtained during 
the optimization step. 
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2.5. Feature pooling 

The coding vectors obtained previously, are pooled over the whole image to form one pooling vector 
who is the final representation of the input image. Between the two popular pooling strategies used in the 
literature namely max pooling and average pooling, it has been shown [17] that max pooling is superior. Given 
a = [ay,...,Qn] € R**” the coding matrix and i = [1,..., N] € Mm the indice of feature location per image 
m = 1,...,M, we obtain the following max pooled vector: 


zm,j => year A455 for j=1,...,k, (3) 


where z, € R* is the vector representing the whole image m. 


3. ONLINE DICTIONARY LEARNING 

The algorithm described in this section in algorithm | is assuming a training set composed of 
independent and identically distributed samples and alternates at each loop between computing the 
decomposition a; of the training sample x; over the dictionary D,_, obtained during the previous iteration 
and updating the dictionary D; by minimizing the function: 


Algorithm 1 Online dictionary learning. 


Precondition: « € R” p(x) (random variable and an algorithm to draw i.i.d sample of p), A € R(regularization 
parameter), Dy € R™**(initial dictionary), T(number of iterations) 


1: Ag + O > (reset the ’past” information) 
2; Bo <— 0 

3: fort + 1toT do 

4: Draw x; from p(x) 

5: Sparse coding: compute using LARS 


gS ll 
ay = argmin gilz — Di-10\|3 + Alla||1-(4) 


aeR* 
; T 
6: At = At_i + ty 
a Bi - Bie + rap 
8: Compute D; using Algorithm 2, with D;_; as warm restart, so that 


t 
eet ; 
Dy = argmin 5D glia — Daills + Allasl.(5) 
_ td T T 
= argmin g(gir(D DA,) —Tr(D* B;)). 
D 


end for 
10: return Dy (learned dictionary) 


2 


3.1. Sparse coding 

The sparse coding problem in (2) with fixed dictionary is an /1-regularized linear least-squares 
problem. In the case of dictionary columns with low correlation, simple methods based on coordinate 
descent with soft thresholding [52, 53] are enough. But the columns of the dictionary are more often than 
not highly correlated, and it is proved that a Cholesky-based implementation Lars algorithm [54, 55] that 
provides the entire regularization path can be as fast as simpler soft thresholding based methods while having 
a higher accuracy. 
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Algorithm 2 Dictionary Update. 


Precondition: D = [d1,...,d,] € R™** (input dictionary) 


1: repeat 
2 for 7 + ltokdo 
3 Update the j-th column to optimize for (6) 
4: Uz < ag (b; Da;) t d; 
1 
5 dj — sax(Iuylad) Ys 
6 end for 


7: until convergence 
8: return D (updated dictionary) 


3.2. Dictionary update 

The dictionary update algorithm uses block coordinate descent with warm restarts, does not require 
any learning rate tuning and is parameter-free and since the vectors are sparse, the coefficients of A are 
diagonaly concentrated the block coordinate descent more efficient. In practice, Algorithm 2 updates each 
j-th column of D sequentially and (6) gives the solution of the dictionary update with respect to the j-th 
column, while keeping the other ones fixed under the constraint d}d; < 1. It has been shown that this 
optimization problem converges to a global optimum [56]. This algorithm uses a warm restart but other 
approaches have been proposed to update D, for instance, [57] suggest using a Newton method on 
the dual of (5). 


4. THE VEHICLE LICENSE PLATE LOCATION ALGORITHM 
4.1. Horizontal gradient 

It is known that the most discriminative part of a vehicle, mainly its front and its rear are composed of 
horizontal lines [58], while the license plate region has a predominance of vertical lines. We apply the Sobel 
operator to obtain the vertical lines [59]. Figure 5(b) shows the resulting image of horizontal gradient detection 
on the original image in Figure 5(a). And to have a clearer emphasis on the license plate region, due to the great 
concentration of high valued pixels, we apply a mean filter [60] to the image. The final stage of the horizontal 
gradient phase can be seen in Figure 5(c). 


4.2. Filtering 

The goal of this phase is to darken every non license plate region. Morphological operations proposed 
in [61, 62] are applied to the image to darken high valued regions that don’t fit the expected size. Small non- 
VLP salient regions are darkened by a morphological opening operation as shown in Figure 5(d). The joint 
application of mean filter and opening operation causes undesirable variation among pixel values of license 
plate region though, perceived as gaps between its vertical saliences. We restore smoothness in these values 
which is equivalent to replenish the artificial created gaps using a morphological closing operation. Big non 
license plate regions are also darkened by a top-hap filtering,so that big salient regions will have their pixel’s 
values lowered as shown in Figure 5(e). As a result the license plate region may exibit unwanted artifacts in that 
the space between the letters and the numbers may be less salient. And the binarization process may therefore 
perform poorly splitting the license plate region into multiple parts, that problem can be fixed by applying a 
closing operation on the image. We then remove the saliences in the licence plate’s borders that may appear 
during the horizontal gradient step to find the tightest boundary of the license plates characters. We finally 
close the filtering phaseby applying an erosion operation followed by a dilation operation. 


4.3. Adjustement 

The adjustement phase generates potential license plate regions in a binary image. The first step is to 
binarize the image to separate salient regions from the background using Otsu’s method [63] to automatically 
define the binarization threshold and minimize the chance of a miss as depicted in Figure 5(g). A problem 
that could arise is that it is possible that the contrast between dark non license plate regions and the brighter 
license plate region is not big enough and could result in a ’botched” binarization. In the connected component 
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analysis we remove any region that is not license plate shaped from the binary image and finally we maintain 
only the license plate region candidates bounding boxes. We also eliminate candidates that are intercepting 
each other because the license plate region is essentially composed of vertical edges that do not intercept each 
other. Other vertical edge dominant candidates like signs that contant letters are not close enough to their 
regions intercept on the other hand candidates originated from background noise have a greater chance to touch 
each other given their random location. For the remaining candidates, we first cut the initial corresponding 
bounding box off of the monochrome image of the filtering phase like shown in Figure 5(f) and consider it as 
a separate candidate image. Second, a new binarization takes place with Otsu’s method and the candidate’s 
boundary is expanded to include lower valued pixels by a dilation operation. The resulting binary image may 
still contain undesired regions, which are erased by an erosion operation followed by a dilation. Finally, the 
resulting candidate bounding box is checked to see if minimum dimension constraints are satisfied if not, the 
candidate is discarded unless it is the last one, in which case it is kept. If there is no candidate or all of them 
are discarded, a histogram equalization is performed in the original image, to get a contrast improvement, and 
the location process is repeated from its very beginning. 


(a) (b) (c) (d) 
(e) (f) (g) 


(h) 


Figure 5. Steps of the vehicle license plate location method: (a) Original image, (b) Horizontal gradient, 
(c) Mean filter on horizontal gradient, (d) Small saliences darkened, (e) Big saliences darkened, (f) Erosion 
and dilation operations, (g) Binarization, (h) Finding the vehicle license plate 


5. CLASSIFICATION METHODS 
5.1. Multi-class support vector machines 

Multi-class classification is achieved here by using a one against one” scheme with probability 
estimates [64—66] thus training for k classes k(k-1)/2 classifiers, one for each pair of classes. Given k classes, 
estimate the pairwise class probabilities: 


1 


rig & Pyy=ily =j ori, x) = Tq 64FeB (7) 


where f is the decision value at x and A, B are estimated by minimizing the log likelihood of training data. 
With all ther;,; we can obtain the p; by optimizing the following objective function: 


k 
oil 2 
arg 5 a de (rsp -_ C5 40;) , (8) 
t=1 f:7 Fi 
k 
subject to p; > 0, V2, Soi =1 
i=l 
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for any x we assign it to the class of its highest p;. 


5.2. Supervised K-means 
The k-means [67] algorithm is one the simplest clustering algorithms which given a set of data x aims 
to separate it into k different clusters by optimizing the following objective function: 


kon 
p 2 
min S> 7 |e; - cil, (9) 
i=1 j=l 
where c; is the cluster center. The standard implementation uses Lloyd’s algorithm [68] which consists of the 
following two steps: 
(a) Initial step: where the centroids are initialized. 
(b) Assignment step: where each data point is assigned to a specific cluster based on its distance with the 
cluster’s cent 
(c) Update step: where we compute the new centroids for each cluster. 
In our implementation, we skip the update step so the centroids are initialized once, using only the training data. 


Then each test point is assigned to a specific cluster using its Euclidian distance to the respective centroid. 


6. EXPERIMENTAL RESULTS 

Using the COMVis car dataset [7] we build our set by clipping automatically from each image the 
most discriminative region of interest for make and model recognition, the front part of the car using [35] 
algorithm for license plate detection. In our experiment we usefive car makes, Suzuki, Toyota, Mitsubishi, 
Hyundai and Honda with respectively 153, 78, 36, 37 and 97 images. Note that these images contain different 
models, 8 for Suzuki, 4 for Toyota, 2 for Mitsubishi and Hyundai and 5 for Honda, thus increasing the overall 
difficulty of the recognition task due to unbalanced datasets. We build our feature matrix by concatenating the 
feature vectors (2x2, 4x4, 8x8, 16x16 blocks) of each image and then optimize the Lasso (2) using [49] and 
the least angle regression [69] (LARS) algorithms to obtain the dictionary and the coding matrix. The pooling 
strategy (3) applied to the coding matrix gives us a representing vector for each image. These vectors are 
then fed to a linear kernel support vector machine (SVM) via a cross-validation framework holding out 80% 
as training data, this process is repeated one hundred times to obtain the recognition rates of Tables | and 2. 
The results here are obtained using the square mapped gradient descriptor (1) applied to a 128x512 image thus 
obtaining 128x1024 resulting matrix, which is a concatenation of the image under the SMG descriptor along 
direction x and the image under the SMG descriptor along y. This new image is considered the input of the 
feature matrix generation step. In Table 2, we use supervised k-means instead of the SVM classifier which 
achieves similar performance as the use of SMG+SVM setup without dictionary learning with an accuracy of 
85.01+2.68, compared to 80.66+2.78 for the SMG+KMeans setup without dictionary learning. 


Table 1. Accuracy of the Kmeans+SMG setup different dictionary and feature vector sizes. 
nb of size of features 
words 2x2 4x4 8x8 16x16 
1 03.75+1.06 05.66+2.03 06.75+3.04 08.98+2.50 
2 42.87+4.35 54.46+4.55 76.49+4.15 79.81+5.20 
4 44.14+4.55  61.7345.01 76.83+5.46 84.15+5.30 
8 45.66+4.11 58.7045.92 72.6544.98  87.30+3.28 
16 46.17+6.15 55.84+6.28  68.3643.93  87.19+2.52 


Table 2. Accuracy of the SVM+SMG setup for different dictionary and feature vector sizes. 
nb of size of features 
words 2x2 4x4 8x8 16x16 
1 03.4542.06 04.2341.28 04.18+2.50 08.64+2.54 
2 46.46+5.83 60.815.45 78.60+4.59 84.38+4.46 
4 48.07+4.50 66.42+4.51 80.0944.59 86.38+3.53 
8 50.7645.17 65.62+4.82 77.52+4.77 89.82+3.27 
16 51.035.09 65.5644.86 75.40+3.23 89.99+2.45 
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7. CONCLUSION 

Despite an amelioration of performance the optimization cost can be quite heavy especially in an 
online framework. Rigamonti et al [70], showed that enforcing sparsity is not helpful for recognition rates when 
extracting the features and that the same performance can be achieved through the use of plain convolution, 
however sparsity is still important when learning the filters. They did use a different equation than (2), where 
the matrix-vector product is replaced by a convolution and a two step algorithm where learning the filters 
and extracting the features are separated. The problem remains that sparsity do not dramatically increase 
performance nor is it required to obtain good results. Even then and to put sparsity in perspective, it must 
be inserted in a bigger framework like in hierarchical models, deep belief networks (DBNs) [71] to show its 
usefulness. 
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